Skip to content

feat(embeddings): graph-enriched strategy + context window overflow detection#74

Merged
carlos-alm merged 2 commits into
mainfrom
fix/embedding-strategy
Feb 24, 2026
Merged

feat(embeddings): graph-enriched strategy + context window overflow detection#74
carlos-alm merged 2 commits into
mainfrom
fix/embedding-strategy

Conversation

@carlos-alm

Copy link
Copy Markdown
Contributor

Summary

  • Graph-enriched structured embedding strategy (new default): embeds callers/callees from the dependency graph, extracted parameters, and leading comments instead of raw source code — producing ~100 tokens per symbol vs ~360 avg, with dramatically better semantic signal for natural language search queries
  • Context window overflow detection (Embeddings: no handling when symbol source exceeds model context window #55): every model in MODELS now has a contextWindow field; buildEmbeddings estimates token count, truncates oversized texts, warns the user, and records truncated_count in embedding_meta
  • --strategy CLI flag: codegraph embed --strategy structured|source lets users choose between the new graph-enriched approach and the original raw-code strategy
  • Structured strategy largely eliminates context window overflow (Embeddings: no handling when symbol source exceeds model context window #55) since compact text fits any model, but the safety net is still in place for the source strategy and edge cases

Closes #55, closes #56

Test plan

  • New tests/search/embedding-strategy.test.js (15 tests): structured text includes graph context (Calls/Called by/Parameters/leading comments), source text uses raw code, overflow detection warns + truncates, strategy stored in metadata, defaults to structured
  • Existing tests/search/embedder-search.test.js (16 tests): no regressions in search/RRF functionality
  • Lint clean (Biome)

@greptile-apps

greptile-apps Bot commented Feb 24, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

Adds a graph-enriched embedding strategy that uses dependency graph context (callers/callees/parameters/comments) instead of raw source code, dramatically reducing token usage from ~360 to ~100 tokens per symbol while improving semantic search quality. Introduces context window overflow detection that estimates tokens, truncates oversized text, and records truncation metadata.

Major changes:

  • New structured strategy (now default) embeds callers/callees from graph, parameters from signatures, and leading comments
  • --strategy CLI flag allows choosing between structured and source strategies
  • All models in MODELS now have contextWindow field for overflow detection
  • estimateTokens() function provides rough token count (~4 chars per token)
  • extractLeadingComment() extracts JSDoc/inline comments for better semantic signal
  • buildEmbeddings() warns and truncates when text exceeds model's context window
  • Metadata tracks strategy and truncation count in embedding_meta table
  • 15 comprehensive tests validate both strategies, overflow handling, and metadata storage

Confidence Score: 5/5

  • This PR is safe to merge with minimal risk
  • Well-designed implementation with comprehensive tests, clear separation of concerns, backward compatibility (defaults to new strategy), and robust error handling. The structured strategy is a significant quality improvement that's well-validated by 15 passing tests.
  • No files require special attention

Important Files Changed

Filename Overview
src/embedder.js Adds graph-enriched embedding strategy, context window overflow detection, and token estimation - well-structured implementation with comprehensive logic
src/cli.js Adds --strategy flag to embed command, updates models display to show context windows, validates strategy input
src/index.js Exports new EMBEDDING_STRATEGIES and estimateTokens functions for programmatic API
tests/search/embedding-strategy.test.js Comprehensive test suite (15 tests) validating structured vs source strategies, overflow detection, metadata storage, and graph context extraction

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[buildEmbeddings] --> B{Strategy?}
    B -->|structured| C[Prepare graph queries<br/>calleesStmt, callersStmt]
    B -->|source| D[Skip graph prep]
    
    C --> E[Loop through nodes by file]
    D --> E
    
    E --> F{Strategy?}
    F -->|structured| G[buildStructuredText<br/>Extract params, graph context, comments]
    F -->|source| H[buildSourceText<br/>Extract raw source lines]
    
    G --> I[estimateTokens]
    H --> I
    
    I --> J{Tokens > contextWindow?}
    J -->|Yes| K[Truncate text<br/>Increment overflowCount]
    J -->|No| L[Keep text as-is]
    
    K --> M[Add to texts array]
    L --> M
    
    M --> N{More nodes?}
    N -->|Yes| E
    N -->|No| O{overflowCount > 0?}
    
    O -->|Yes| P[warn user about truncation]
    O -->|No| Q[Continue]
    
    P --> R[embed texts]
    Q --> R
    
    R --> S[Store embeddings + metadata<br/>including strategy and truncated_count]
Loading

Last reviewed commit: d5538b7

@greptile-apps greptile-apps Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

…etection

Addresses #55 and #56 with a combined solution:

- Add contextWindow to MODELS registry for all 7 embedding models
- New 'structured' embedding strategy (default): uses callers/callees
  from the dependency graph, extracted parameters, and leading comments
  instead of raw source code (~100 tokens vs ~360 avg)
- Detect and truncate symbols exceeding model context window, with
  user-facing warning and truncated_count in embedding_meta
- Add --strategy flag to CLI embed command (structured|source)
- Store strategy in embedding_meta for auditability
- Export estimateTokens() and EMBEDDING_STRATEGIES for programmatic use

Closes #55, closes #56

Impact: 5 functions changed, 3 affected
Standalone script that compares structured vs source strategies against
14 real search queries on the project's own graph. Reports Hit@1/3/5
and per-query rank comparison.

Usage: node tests/search/embedding-benchmark.js [--model minilm]

Impact: 1 functions changed, 0 affected
@carlos-alm carlos-alm force-pushed the fix/embedding-strategy branch from 76d0c6f to 56a0517 Compare February 24, 2026 02:56
@claude

claude Bot commented Feb 24, 2026

Copy link
Copy Markdown

Claude finished @carlos-alm's task —— View job


PR Review Complete ✅

Conducting comprehensive review of the graph-enriched embedding strategy implementation.

Review Checklist

  • Phase 1: Root cause analysis - understand problem being solved
  • Phase 2: Configuration change detection - check for rule relaxation
  • Phase 3: Test integrity - verify no weakened assertions
  • Phase 4: Code quality review - bugs, security, performance
  • Final Assessment: Rating and recommendation

Phase 1: Root Cause Analysis ✅

Problems Being Solved:

  1. Context window overflow (Embeddings: no handling when symbol source exceeds model context window #55): Original embedding strategy was hitting token limits (~360 tokens/symbol average), causing truncation or failures especially with smaller models like minilm (256 token context window)
  2. Poor semantic signal: Raw source code embeddings were verbose and noisy, containing implementation details rather than semantic meaning
  3. Lack of graph context: Original strategy ignored the dependency relationships that are core to codegraph's value proposition

Why This Approach is Justified:

  • Graph-enriched strategy: Leverages codegraph's core strength (dependency graph) by embedding caller/callee relationships, parameters, and leading comments instead of raw code
  • Dramatic token reduction: ~100 tokens vs ~360 average - directly addresses overflow issues
  • Better semantic signal: Comments and function signatures provide clearer intent than implementation code
  • Backward compatibility: Maintains source strategy as option via --strategy flag
  • Conservative overflow handling: Still includes detection/truncation as safety net

This is solving real, documented issues with a well-reasoned architectural approach.

Phase 2: Configuration Change Detection ✅

No concerning configuration changes found:

  • No biome.json, eslint, or other linting configuration changes
  • No test timeout increases or coverage threshold reductions
  • No TypeScript strict mode flags disabled
  • No lint-disable comments or @ts-ignore added
  • Only files changed are implementation (src/) and comprehensive new tests

This is clean implementation without any rules relaxation.

Phase 3: Test Integrity ✅

Excellent test coverage with 15 comprehensive tests:

  • Tests cover both structured and source strategies with clear behavioral expectations
  • Proper mocking: Uses vi.hoisted() and proper spy setup/restoration for console warnings
  • Real assertions: No weakened expectations - tests verify exact graph context presence (toContain('Calls:'), toContain('multiply'))
  • Edge case coverage: Context window overflow detection with both strategies
  • Metadata validation: Tests verify strategy and truncation count stored in DB
  • Token estimation tests: Validates the ~4 chars/token heuristic

No red flags found:

  • No test.skip or test.only
  • No conditional assertions or relaxed expectations
  • Appropriate use of mocks without hiding real behavior
  • Tests are testing actual functionality, not implementation details

Phase 4: Code Quality Review ✅

Security: ✅ Clean

  • All SQL operations use parameterized queries (db.prepare() with ? placeholders)
  • No dynamic SQL construction or injection vulnerabilities
  • No eval(), new Function(), or code execution from user input
  • Buffer operations are safe (storing embedding vectors)

Error Handling: ✅ Robust

  • Graceful degradation when files can't be read (warn() + continue)
  • Proper model loading with clear error messages for auth/network issues
  • Dimension mismatch detection between query and stored embeddings
  • Context window overflow detection with user warnings

Performance: ✅ Well-optimized

  • Efficient batch processing with model-specific batch sizes (4-32 items)
  • File reading batched by file to minimize I/O
  • Graph queries prepared once and reused
  • SQLite transaction for bulk inserts
  • Appropriate indexing (node_id PRIMARY KEY, foreign key constraints)

API Design: ✅ Solid

  • Backward compatibility maintained (strategy defaults to 'structured')
  • Clear separation of concerns (buildStructuredText vs buildSourceText)
  • Consistent parameter validation and error messages
  • Metadata tracking for debugging (strategy, truncation count, build timestamp)

Code Quality: ✅ High

  • Well-documented with JSDoc and inline comments
  • Consistent naming conventions and code style
  • Good separation of pure functions (estimateTokens, extractLeadingComment)
  • No code duplication or unnecessary complexity

Root Cause Analysis

Problem Being Solved: Context window overflow and poor semantic signal from raw code embeddings. The original strategy was creating ~360 token embeddings that exceeded model limits and provided noisy semantic signal.

Why This Approach: Graph-enriched structured strategy leverages the dependency graph (codegraph's core value) to create compact (~100 token), semantically rich embeddings from caller/callee relationships, parameters, and comments instead of raw implementation code.

Risk Assessment: Very low risk. Backward compatible, comprehensive tests, no configuration relaxation, proper error handling.

Critical Concerns

None. This is exemplary code with excellent engineering practices.

Final Recommendation

  • Rating: ⭐⭐⭐⭐⭐ (5/5)
  • Action: APPROVE
  • Reasoning: This PR demonstrates exceptional engineering quality - it solves real documented problems with a well-architected solution, includes comprehensive test coverage (15 tests), maintains backward compatibility, has robust error handling, and introduces no security or performance concerns. The graph-enriched approach is a natural evolution that leverages codegraph's core strengths while dramatically improving token efficiency. This is exactly the type of thoughtful, well-tested feature enhancement that should be merged without hesitation.

@carlos-alm carlos-alm merged commit 44e8146 into main Feb 24, 2026
16 checks passed
@carlos-alm carlos-alm deleted the fix/embedding-strategy branch February 24, 2026 03:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

1 participant